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Components  of  SPACE-LRC 


■  GEMS  (Purdue  University):  Unstructured  near-body  solver 

■  CASTLES  (AFRL-West):  High-order  Cartesian  off-body  solver 

■  Kokkos  is  integrated  into  CASTLES 

■  PUNDIT  (CREATE-AV):  Mesh  communication  between  GEMS  and  CASTLES 

■  SAMRAI  (LLNL):  Adaptive  meshing  for  off-body 
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Structure  of  CASTLES 


Control  API 


Timestepping 

Time  derivatives  for  physical  quantities 


Geometry 

Handles  spatial  discretization 


System 

Specifies  system  of  equations 


Equations 

Physics-independent  quantities 


Physics 

Turbulence  models 
Detailed  chemical  kinetics 
Chung  Viscosity  Model  (ported  to  Kokkos) 
Peng-Robinson  Equation  of  State  (ported  to  Kokkos) 
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What  is  Kokkos? 


"Performant  cross  platform  parallelism":  write  once,  compile  for  anything. 

Parallel  patterns  (for,  reduce,  scan)  accept  user-defined  functors  (like  Thrust  or  Intel  TBB) 

Backends  for  Nvidia  GPU,  Intel  Xeon,  Xeon  Phi,  IBM  Power8,  others. 

"View"  data  structure  provides  optimal  layout: 

cache-order  access  when  compiled  for  CPU,  coalesced  access  when  compiled  for  GPU. 

Thrust  offers  similar  multi-platform  backends  -  but  less  low  level  control  and  does  not  abstract  data  layout. 
Programming  Guide: 

https://github.com/kokkos/kokkos/blob/master/doc/Kokkos  PG.pdf 

At  GTC  2017: 

S7344  -  Kokkos  :  The  C++  Performance  Portability  Programming  Model 
S7253  -  Kokkos  Hierarchical  Task-Data  Parallelism  for  C++  HPC  Applications 
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Enabling  Kokkos  in  CASTLES 


CASTLES  is  a  Cartesian  solver  written  in  Fortran  90. 

■  Identify  performance  limiting  subroutines 

■  Port  Fortran  subroutines  to  Kokkos  C++ 

■  Optimize  ported  routines 

■  Minimally  invasive  integration  of  Kokkos  C++  with  CASTLES 
("code  surgery") 
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Identify  critical  subroutines  -  CPU  profile 


Quick  and  easy  single-process  profile  with  nvprof: 


nvprof  — cpu-prof iling  on 

--cpu-profiling-mode  top-down  . /CASTLES. x 

I  like  the  top-down  view... 

easy  to  see  global  structure  and  call  chains. 

Can  also  do  bottom  up  profile  (default) 
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Identify  critical  subroutines  -  CPU  profile 


Quick  and  easy  single-process  profile  with  nvprof: 

nvprof  — cpu-prof iling  on 

--cpu-profiling-mode  top-down  . /CASTLES. x 

I  like  the  top-down  view- 

easy  to  see  global  structure  and  call  chains. 

Can  also  do  bottom  up  profile  (default) 

Looks  like  those  "preos"  and  "chung"  routines 
are  burning  a  lot  of  CPU  time 


========  CPU  profiling  result  (top  down) : 

51.29%  clone 
|  51.29%  start_thread 

|  51.29%  orte_progress_thread_engine 

|  51.2  9%  opal_libevent2021_event__base_loop 

|  51.29%  poll_dispatch 

|  51.29%  poll 

48.54%  MAIN. _ 

|  48.45%  interf acetime_mp_maintimeexplicit_ 

|  |  48.45%  interfacetime_mp_rhstimessp34_ 

29.77%  interf acegeom_mp_rhsgeomrescalc_ 

I  15.46%  interf acegeom_mp_rhsgeom3dresadllr_ 

I  |  15.35%  interf acesysexternal_mp_rhssysupdiss__ 

I  |  |  15.35%  interf acesysinternal_mp_rhssysscalarupdiss_ 

I  |  |  9.85%  eosmodule_mp_eoscalcrhohOf romtp_ 

III  I  9.64%  eosmodule_mp_eosrhohf romtpprop_ 

III  |  9.64%  preosmodule_mp_preosrhohf romtpprop_ 

I  |  |  5.18%  eosmodule_mp_e os gamma jacobianproperties_ 

I  |  |  5.10%  preosmodule_mp_preosgamma jacobianpropert ies_ 

|  13.90%  interf acegeom_mp_r hsgeom3dvi seres 2_ 

I  |  13.84%  interf acesysexternal_mp_rhssysviscflux_ 

|  |  13.32%  preosmodule_mp_preosviscousf luxproperties_ 

I  |  |  7.85%  chungtransmodule_mp_chungcalctransprop_ 

|  |  |  3.27%  preosmodule_mp_preoscriticalstate__ 

18 . 33%  interf acegeom_mp_bcgeomrescalc_ 

|  14.77%  interf acegeom_mp_bcgeomsubin_ 

|  |  14.77%  interf aceeqnf luids_mp_bcf luidseqnsubin_velocity__ 

|  |  14.77%  preosmodule_mp_preoscalct f romhp_ 

|  3.56%  interf acesysexternal_mp_stepsys3dcalcqadd_ 

|  3.53%  eosmodule_mp_eosthermalproperties_ 

|  |  3.50%  preosmodule_mp_preosthermalproperties_ 
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Peng-Robinson  equation  of  state  and  Chung  transport  model 


Peng-Robinson  Equation  of  State: 

Computes  physical  properties  (density,  enthalpy,  etc.)  for  real  gas 
mixtures  at  high  pressure 

Chung  Transport  Model: 

Computes  transport  properties  (viscosity,  thermal  conductivity, 
mass  diffusivity)  for  real  gas  mixtures  at  high  pressure 

Many  underlying  subroutines  shared  between  Chung  and  P-R. 

Properties  are  computed  individually  per  cell 
(or  interpolated  points  at  cell  interfaces), 

so  trivially  parallel 

Relatively  small  data  transfer,  lengthy  computation 
=>  perfect  for  GPU  offload 

Input/output  data  scales  linearly  with  number  of  species  (NS) 

Subroutines  contain  single  loops,  double  loops,  triple  loops  over  NS 
=>  runtime  scales  like  a*NS  +  b*NS2  +  c*NS3 

Occupies  significant  majority  of  CASTLES  runtime  for  ns  >=  4ish 


Cubic  polynomial  fits  P-R  scaling 
with  number  of  chemical  species 


Runtime  on  GPU  - Poly.  (Runtime  on  GPU) 


o 

a 


(JO 

01 
a. 

i/i 

-o 
c 
o 

u 
0) 
to 

O.OOE+OO 

5  8  11  14  17  20  23  26  29  32  35  38  41  44  47  50 

Number  of  species  (NS) 


2.50E-05 

y  =  2E-10x3  +  3E-09x2  +  2E-08x  +  6E-09 
2.00E-05  R2  =  0.9999 

1.50E-05 

1.00E-05 

5.00E-06 
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Architecture  of  my  Kokkos  framework 

Designed  for  minimally-invasive  operation  alongside  large  Fortran  code. 


Frame 

//  Owns  and  allocates  TVProperties  object 
TVProperties*  tvproperties; 

//Controls  Kokkos  initialization/finalization 
void  initialize^..); 
void  finalize(...); 

TVProperties*  gettvpropertiesQ; 


Everything  is  controlled  from 
Fortran  through  a  single 
lightweight  global  Frame  object. 

Kernel  launches  and  data  comms 
are  referred  to  TVProperties* 
owned  by  Frame. 
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Architecture  of  my  Kokkos  framework 

Designed  for  minimally-invasive  operation  alongside  large  Fortran  code. 


Frame 

//  Owns  and  allocates  TVProperties  object 
TVProperties*  tvproperties; 

//Controls  Kokkos  initialization/finalization 
void  initialize^..); 
void  finalize(...); 

TVProperties*  gettvpropertiesQ; 


Everything  is  controlled  from 
Fortran  through  a  single 
lightweight  global  Frame  object. 

Kernel  launches  and  data  comms 
are  referred  to  TVProperties* 
owned  by  Frame. 


TVProperties 

//  Owns  and  allocates  TVImpI  object 
TVImpI*  impl; 

//  Public  member  functions  to  communicate  data 

//  to/from  Views  in  TVImpI 

void  populatelnputStripe(...); 

void  populateOutputStripe(...); 

void  populateprEOSSharedData(...); 

void  populatechungSharedData(...); 


//  Public  member  functions  to  launch  collections  of 
//  kernels 

void  prEOSThermalProperties(...); 
void  prEOSViscousProperties(...); 
void  eosGammaJacobianProperties(...); 


-  B&EBBS3kSS8I  7 

DISTRIBUTION  A:  Approved  for  public  release;  distribution  unlimited.  PA  Clearance  Number  17207  “j  “j 


Architecture  of  my  Kokkos  framework 

Designed  for  minimally-invasive  operation  alongside  large  Fortran  code. 


Frame 

//  Owns  and  allocates  TVProperties  object 
TVProperties*  tvproperties; 

//Controls  Kokkos  initialization/finalization 
void  initialize^..); 
void  finalize(...); 

TVProperties*  gettvpropertiesQ; 


Everything  is  controlled  from 
Fortran  through  a  single 
lightweight  global  Frame  object. 

Kernel  launches  and  data  comms 
are  referred  to  TVProperties* 
owned  by  Frame. 


TVProperties 

//  Owns  and  allocates  TVImpI  object 
TVImpI*  impl; 

//  Public  member  functions  to  communicate  data 

//  to/from  Views  in  TVImpI 

void  populatelnputStripe(...); 

void  populateOutputStripe(...); 

void  populateprEOSSharedData(...); 

void  populatechungSharedData(...); 


//  Public  member  functions  to  launch  collections  of 
//  kernels 

void  prEOSThermalProperties(...); 
void  prEOSViscousProperties(...); 
void  eosGammaJacobianProperties(...); 


TVImpI 

//  Contains  members  of  TVProperties  that  don't  need 
//  external  visibility  (pimpl  idiom) 

//  Owns  and  allocates  Kokkos  Views 
ViewlDTypeT; 

ViewlDType  P; 

ViewlDType  Yi; 

...(several  dozen  of  these) 

//  Owns  std::unordered_maps  to  launch  kernels 
//  and  communicate  data  by  name 
unordered_map<string,ViewlDType> 
selectlDViewByName; 
unordered_map<string,View2DType> 
select2DViewByName; 

//  Owns  Launcher  for  each  kernel 
//  (lightweight  wrapper  with  string  identifier, 

//  inherits  common  timing  routines  from 
//  LauncherBase) 

unordered_map<string,LauncherBase*>  launchers; 
void  safeLaunch(...); 
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For  modularity  and  consistency:  one  subroutine->one  kernel 


Fortran  subroutine 

pure  subroutine  prEOSCalcSoundSpeed ( & 
rho,  rhop,  rhoT,  hp,  hT,  c) 
use  useGENKindDef s,  only:  dp 
implicit  none 

real (dp) ,  intent (in)  ::  rho,  rhop,  rhoT,  hp,  hT 
real (dp) ,  intent (out)  ::  c 

c  =  sqrt (  rho*ht/& 

(  rho*rhop*ht  +  rhot*  (  1 . 0_dp-rho*hp  )  )  ) 

end  subroutine  prEOSCalcSoundSpeed 


Operates  on  a  single  grid  point  at  a  time 


Kokkos  kernel  launch 

parallel_for (  tvimpl->nActivePoints , 
KOKKOS_LAMBDA (const  int&  t) 

{ 

c(t)  =  sqrt (  rho (t ) *hT  (t ) / 

(  rho (t) *rhoP (t) *hT (t) 

+  rhoT  (t ) *  (  1.0-rho (t) *hP (t)  )  )  ) 

}  ) ; 


Operates  on  nActivePoints  grid  points  in  parallel 

c,  rho,  hT,  etc.  are  Kokkos  Views, 
captured  by  value  from  members  of  TVImpl 
(View  copy  constructor  is  a  lightweight  shallow  copy) 

t  is  the  parallel  work  index 


There  are  roughly  50  of  these  that  serve  as  building  blocks. 
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GPU  Speedups  for  Standalone  Peng-Robinson 


Good  speedups  overall. 

GPU  speedup  is  better  for  fewer  species  (NS) 

■  smaller  per-thread  data  set  =>  improved 
cache  hit  rates  on  GPU 

■  Smaller  inner  loops  =>  vectorization  less 
efficient  on  CPU 

(a  combination  of  GPU  doing  better  and  CPU 
doing  a  bit  worse) 


*  Nvidia  Kepler  K40  **lntel  Xeon  E5-2620  v3  CPU 


GPU*  vs.  Serial  Fortran** 


■  5  species  10  20  H50 

32.6X 


Fortran  (Serial) 
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Integrating  Kokkos  with  CASTLES:  Interface  Functions 

C++  Interface  functions  (callable  from  Fortran)  tell  Frame  object  to  initialize/finalize  Kokkos,  launch  collections  of  kernels,  or  communicate  data. 


Decorated  with  'extern  "C"  to  disallow  name  mangling,  added  trailing  underscore_  expected  by  Fortran  linker. 


Interface  function  to  initialize  Kokkos  and  allocate  internal  storage 


extern 

"C"  void  frame 

_initialize_ (  int 

device_ 

id. 

int 

nGridPoints 

int 

ns 

int 

nq 

{ 

int 

iTurb  ) 

frame .initialize ( 

device_id,  / /  GPU 

device 

to  select 

nGridPoints , 

// 

Chunk  size  for 

Kokkos 

launches 

ns, 

// 

Num  chemical  species 

nq. 

// 

Utility  values 

} 

iTurb  ) ; 

Corresponding  Fortran  call 

!  Compute  KokkosDevicelD  as  MPI  rank%num  devices 
!  Num  devices  is  supplied  by  input  file 
call  f rame_initialize (  %VAL (  KokkosDevicelD, & 
%VAL (  KokkosMaxBlock  )  ,  & 

%VAL (  nspe  ) , & 

%VAL (  nq  )  ,  & 

%VAL (  iTurbType  )  ) 


Interface  function  to  launch  collection  of  kernels  for  thermal  and  viscous  properties 


extern  "C"  void 

f rame_tvpr ope rties_eos thermal andviscouspr ope rties_ ( 
int  nActivePoints  ) 

{ 

frame . gettvproperties ( ) ->eosThermalAndViscousProperties ( 
nActivePoints  ) ; 

} 


Corresponding  Fortran  call 


call 

f rame_tvpr ope rties_eos thermal andviscouspr ope rties& 
(%VAL (NumThisStripe) ) 
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Communicating  Data 

Data  communication  must  translate  between  4D  Fortran  pointers  (x,y,z,dataindx)  and  Kokkos  Views.  For  some  computations,  a  halo  of  fringe  points  must  be  ignored. 


1.  C++  framework  receives  double*  from  Fortran 

2.  Iterates  linearly  through  x,y,z  values,  copying  data  to  Views  and  skipping  fringe  points. 

3.  In  Views,  x,y,z  indices  are  flattened  into  a  single  parallel-work  index,  t. 

4.  After  computation,  reverse  the  process,  copying  data  from  Views  back  into  double*  storage  with  data  layout  expected  by  Fortran 

C++  framework  must  know  xdim,  ydim,  zdim,  and  fringe  boundaries  to  unpack  and  repack  data.  Annoying  indexing  math... 


Corresponding  Fortran  call 

!  Name  tag  of  destination  View 
tag  =  "Q"//char(0) 

call  f rame_castles_populateinputstripe (tag, & 
Q, &  !  4D  Fortran  pointer,  source  of  copy 

%VAL (NumX) ,  %VAL (NumY) ,  %VAL (NumZ) , & 

%VAL (SptX) ,  %VAL (EptX) , & 

%VAL (SptY)  ,  %VAL (EptY)  ,  & 

%VAL ( Spt Z )  ,  %VAL (EptZ )  ,  & 

%VAL (SptData) ,  %VAL (EptData) , & 

%VAL (SptStripe) ,  %VAL (EptStripe)  ) 


C++  interface  function 


extern  "C"  void  frame_ 

castl 

const  char  name [8], 

// 

double*  data. 

// 

int  nx,  int  ny,  int  nz 

,  // 

int  SptX,  int  EptX, 

// 

int  SptY,  int  EptY, 

// 

int  SptZ,  int  EptZ, 

// 

int  SptData, 

// 

int  EptData, 

// 

int  stripeStart, 

// 

int  stripeEnd) 

// 

// 

Dims  of  block  (including  fringes) 
Fringe  boundaries  in  x-direction 
"  y-direction 
"  z-direction 


Start  and  end  of  selected  x,y,z 
stripe;  used  when  looping  over  block 
in  chunks  (stripes)  of  fixed  size 


frame . gettvproperties ( ) ->populateInputStripe (name, 
data,  nx,  ny,  nz,  SptX,  EptX,  SptY,  EptY, 

SptZ,  EptZ,  SptData,  EptData,  stripeStart,  stripeEnd) ; 


Fortran  <->  C++  communication  works  as  follows: 
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Data  marshalling  challenges 


Challenge  #1: 

Kokkos  launches  need  enough  parallel  work  (enough  grid  points)  to  saturate  GPU. 


Solution: 

Ensure  availability  of  this  process'  entire  block  of  data  where  Kokkos  interface  functions  are  called. 
Restructuring  some  Fortran  calling  functions  was  required,  but  minimal  impact  to  code  overall. 


Challenge  #2: 

Block  size  handled  by  each  process  may  change  between  timesteps,  due  to  adaptive  mesh  refinement. 
Prefer  not  to  reallocate  Kokkos  data  structures,  or  worse,  exhaust  GPU  memory. 


Solution: 

Launch  Kokkos  computations  via  a  loop  over  this  process'  block  in  chunks  of  largeish  but  fixed  size 
"KokkosMaxBlock." 

KokkosMaxBlock  is  a  tuning  parameter  in  input  file,  large  enough  that  one  chunk's  launch  should  saturate  GPU  when 
10-20  processes  are  sharing  the  GPU  via  Nvidia  Multi-Process  Service. 

KokkosMaxBlock  =  8192  or  12288  usually  gives  good  performance. 
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Cluster-level  concerns:  Multiple  GPUs  per  node 


Kokkos  can  handle  multiple  GPUs. 


Standalone  Kokkos  application: 

Pass 

--kokkos-ndevices=2 

on  the  command  line  and  call 

Kokkos :: initialize (int&  argc,  char*  argv [ ] ) 

within  code. 

Kokkos  will  detect  available  GPUs  and  assign  MPI  ranks  to  GPUs 
round  robin. 

Minor  Caveat: 

If  MPI  process  is  bound  to  a  specific  set  of  cores,  Kokkos  does  not  try 
to  select  the  optimally  hardware  co-located  GPU 
(this  may  have  changed  since  last  I  checked). 


My  application  (embedded  deep  within  a  big  Fortran  code): 

Pass  number  of  available  GPU  devices  in  input  file. 

Manually  compute  which  device  to  use  as  (MPI  rank%num  devices) 

Tell  this  process'  Kokkos  kernels  to  use  that  device  as  follows: 

Kokkos :: InitArguments  args; 
args  .  device__id  =  device_id; 

Kokkos :: initialize (  args  ); 
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Cluster-level  concerns:  Nvidia  Multi-Process  Service  (MPS) 


Without  MPS: 

Each  MPI  process  has  its  own  CUDA  context. 
Multi-process  profile  shows  one  process  at  a  time  using  a  given  GPU. 


With  MPS: 

Multiple  processes  can  share  a  given  GPU  simultaneously. 


Runtime  API 

Driver  API 

Profiling  Overhead 
-  [OJ  Tesla  K40m 
-  Context  context  rank  0  (CUDA) 

7  Memcpy  (HtoD) 

/  MemCpy  (DtoH) 
ffl  Compute 

0  Streams 

Stream  7 

l  l  1  l  l  l 

I  1  1  M  1  1 

II  1  1  (■  1  1 

9  Context  context  rank  2  (CUDA)  1 

V  > 

1  *7  Memcpy  (HtoD) 
'  MemCpy  (DtoH) 
+  Compute 
-  Streams 
Stream  7 


Kernels  from  different  processes  do  not  overlap 


Kernels  from  different  processes  overlap 
For  small  NS,  turning  on  MPS  makes  overall 
application  up  to  3X  faster 


Better  utilization  and  dramatic  speedup  for  my  application,  and  easy  to  use 

(just  run  nvidia-cuda-mps -control  -d  on  each  compute  node  to  start  the  daemons). 


See  http://on-demand.gputechconf.com/gtc/2016/pre 
programming-mpi.pdf 
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GPU  Speedup  of  Overall  CASTLES+Kokkos 


Production-style  runs: 

40  MPI  ranks  on  2  nodes. 

■  CASTLES  Fortran  uses  20  CPUs/node 
only. 

■  CASTLES+Kokkos  uses  20  CPUs  +  2 
GPUs/node. 

■  Speedup  computed  as  (CASTLES 
Fortran  runtime)/(Castles+Kokkos 
runtime) 

2. 5-3. OX  consistently  observed  across 

range  of  desirable  problem  parameters. 
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Kokkos  on  CPU  matches  Fortran  on  CPU 


Can  the  Kokkos-enabled  codebase  compile  for  CPU  as  well  as  GPU,  with  good  performance? 


Test  case:  NS=5, 163  grid  points,  50  timesteps 


30.2  s 

28.5  s 

CASTLES+Fortran  (1  CPU)  CASTLES+Kokkos  OpenMP 

(1  thread)** 


**KMP_AFFINITY=compact,l,granularity=core 


Often,  naively  porting  Fortran  to  C++  can  result  in  a  slowdown  (e.g.  compiler  has  a  harder  time  optimizing/vectorizing  loops). 
Need  to  use  hardware-specific  (Intel)  compiler  and  manually  tweak  vector  pragmas  for  some  in-kernel  loops,  but  in  the  end 

Kokkos  C++  is  as  fast  as  original  Fortran. 


To  compile  for  CPU,  just  change  arguments  to  makefile. 

nvcc  ignores  Intel  pragmas.  Kokkos-enabled  source  code  is  (almost  entirely)  same  as  used  for  GPU. 

Only  two  kernels  needed  moderately  divergent  code  for  good  performance  on  both  CPU  and  GPU. 
Kokkos  build  system  provides  pragmas  to  select  different  code  when  compiling  for  different  hardware: 


KOKKOS_LAMBDA  (  const  intS  t  ) 

{ 

#ifdef  KOKKOS_HAVE _ CUD A 

...GPU-optimal  code  goes  here... 
#else 

...CPU-optimal  code  goes  here... 
#endif 

} 


Kokkos  promise  of  "performant  cross-platform  parallelism"  more  or  less  fulfilled. 
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Node  level  performance  +  comparison  with  Xeon  Phi 

Kokkos  runs  on  Xeon  Phis  in  native  mode: 

■  MPI+Kokkos  processes  see  Phi  cores  as  additional  CPU  cores. 

■  Kokkos  computations  are  not  offloaded  GPU-style. 

■  Entire  process  runs  on  a  set  of  Phi  cores  just  like  on  a  multicore  CPU. 

GPUs  are  offload  coprocessors,  so  can't  compare  Phi  vs.  GPU  apples-to-apples.  But  we  can  get  an  idea  at  node  level. 


Runtime  for  fixed  problem  size  1203,  NS=5, 1st  order,  20  timesteps 


726  s 


CASTLES  Fortran:  CASTLES+Kokkos:  CASTLES+Kokkos:  CASTLES+Kokkos: 
20  CPU  cores  20  CPU  cores  +  2  Xeon  Phi  Knight's  Xeon  Phi  Knight's 

GPUs  Corner  Landing 
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System  details 


2x10  core  Intel  Xeon  E5-2650v3 
Config  file  for  Intel  MPI: 

-genv  I_MPI_PIN_DOMAI8=auto : compact 
-n  20  . /CASTLES .kokkos 

Although  cores  are  hyperthreaded  (40  logical  cores  available), 
adding  more  processes  does  not  improve  performance  noticeably. 


2x10  core  Intel  Xeon  E5-2650  v3 
+  2  Kepler  K40  GPUs. 

Same  MPI  config  as  CASTLES  Fortran. 


One  Knight's  Corner  5110P 
(60  cores,  240  logical  processors). 
Config  file  for  Intel  MPI: 


-genv  I_MPI_PIN_DOMAIN=4 : compact  -genv  OMP_NUM_THREAD S  4 


-host  micO  -n  60  -env  KMP_AFFINITY=compact, granularity=core  . /CASTLES . knc 


One  Knight's  Landing  7230  (64  cores,  256  logical  processors),  using  SNC4  clustering 
Config  file  for  Intel  MPI: 

-genv  I _MP I_PIN_DOMAIN=l :  compact  -genv  OMP_NUM_THREADS  1 

-n  256  -env  KMP_AFFINITY=compact , granular ity=core  numactl  -m  4, 5, 6, 7  , /CASTLES .knl 

Numactl  -m  4, 5, 6, 7  encourages  first-touch  allocation  in  onboard  high-bandwidth  memory. 

I  experimented  with  fewer  MPI  processes,  bigger  domains,  and  more  OpenMP  threads, 
and  found  256  procs  with  1  thread/proc  best. 
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Bandwidth  Optimizations  for  Per-Grid-Point  Inner  Loops 

P-R  and  Chung  involve  nested  inner  loops  over  chemical  species  NS  (can  be  50  or  more). 
Independent  calculations  for  each  grid  point. 

Toy  example  (shown  as  CPU-style  serial  loop  over  grid  points): 

//  Loop  over  N  grid  points  (trivially  parallel) 
f or (  int  t  =  0;  t  <  N;  t++  ) 

/ /  ax [ ] ,  ay [ ] ,  bx [ ] ,  and  by [ ] : 

//  arrays  of  size  NS*N  that  store  per-grid-point  input  data, 
f or (  int  y  =  0;  y  <  NS;  y++  )  //  NS  ~  up  to  50ish 
f or (  int  x  =  0;  x  <  NS;  x++  ) 

output [N* (NS*y+x) +t ]  =  ax [x*N+t ] *ay [y*N+t ]  +  bx [x*N+t] *by [y*N+t] ; 


"Embarrassingly  parallel,"  and  inner  loops  are  simple... 
but  achieving  high  performance  is  an  interesting  problem! 


DISTRIBUTION  A:  Approved  for  public  release;  distribution  unlimited.  PA  Clearance  Number  17207 


24 


Bandwidth  Optimizations  for  Per-Grid-Point  Inner  Loops 

P-R  and  Chung  involve  nested  inner  loops  over  chemical  species  NS  (can  be  50  or  more) 
Independent  calculations  for  each  grid  point. 


Toy  example  (shown  as  CPU-style  serial  loop  over  grid  points): 


Several  X-dependent  loads  Several  Y-dependent  loads 


"Embarrassingly  parallel,"  and  inner  loops  are  simple... 
but  achieving  high  performance  is  an  interesting  problem! 
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Testing  Parameters 


Tesla  K40  GPU 

■  12  GB  device  memory 

■  15  Kepler  SMs 

Kepler  architecture: 

■  192  single-precision  cores  and  64  double-precision  cores  per  SM 

■  100%  occupancy  =  2048  active  threads  per  SM 

■  65,536  registers  available  per  SM 

■  64KB  LI  cache/shared  memory  per  SM,  configurable  as  either  48  KB  LI  +  16  KB  shared,  32  KB  LI  +  32  KB  shared, 
or  16  KB  LI +  32  KB  shared 

■  48  KB  read-only  cache  (declare  pointers  with  const _ restrict _ to  use  this**) 

Compiled  with  nvcc  version  7.5,  opt-in  LI  caching,  verbose  to  see  register/local  mem  use,  targeting  compute  capability  3.5 

nvcc  -Xptxas="-dlcm=ca"  -Xptxas="-v"  -arch=sm_35  kernels. cu 

Runtime  call  to  cudaDeviceSetCacheConf  ig  ( cudaFuncCachePref erLl )  to  set  the  48  KB  LI  +  16  KB  shared 
option  in  case  the  compiler  chooses  to  load  via  LI 

For  timing  purposes,  I  use  N=2048*120,  NS=64,  960  blocks,  256  threads/block.  On  a  K40  with  15  SMs,  this  is  8  full  waves. 
Kernel  wall  times  averaged  over  10  trials. 

**ln  subsequent  examples,  I  do  not  write  "const."  Although  the  is  pretty  adamant  that  writing  "const"  is  necessary  to  trigger 

loads  via  the  48  KB  read-only  cache,  I  found  that  for  toy  kernels  presented  here,  the  compiler  uses  read-only  cache  even  if  "const"  is  omitted. 
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Naive  Cuda  Kernel  -  one  thread  per  grid  point 


{ 


.global _  void  naive (  double*  _ restrict _  ax,  double*  _ restrict _  bx, 

double*  _ restrict _  ay,  double*  _ restrict _  by,  double*  _ 

//  Ordinarily  we  might  wrap  this  in  a  grid  stride  loop. ..omitted  to  save  space, 
int  t  =  threadldx.x  +  blockldx . x*blockDim . x; 

#pragma  unroll  1  //  Disallow  compiler  unrolling  so  we  know  what's  happening.** 
for (  int  y  =  0;  y  <  NS;  y++  ) 

#  pragma  unroll  1 

for (  int  x  =  0;  x  <  NS;  x++  ) 

output [N* (NS*y+x) +t ]  =  ax [x*N+t] *ay [y*N+t]  +  bx [x*N+t ] *by [y*N+t ] ; 


restrict. 


**lf  we  omit  the  "  "s  and  let  the  compiler  unroll  as  it  wishes,  register  use  goes  up  (as  expected),  occupancy  falls,  and  the  "naive"  kernel's  performance 

worsens.  100%  occupancy  is  not  always  essential,  but  in  this  case,  explicitly  including  the  pragmas  is  better  than  relying  on  compiler  heuristics. 


output  ) 
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Naive  Cuda  Kernel  -  one  thread  per  grid  point 


_ global _  void  naive (  double*  _ restrict _  ax,  double*  _ restrict _  bx, 

double*  _ restrict _  ay,  double*  _ restrict _  by,  double* 


_ restrict _  output  ) 


//  Ordinarily  we  might  wrap  this  in  a  grid  stride  loop. ..omitted  to  save  space, 
int  t  =  threadldx.x  +  blockldx . x*blockDim . x; 

#pragma  unroll  1  //  Disallow  compiler  unrolling  so  we  know  what's  happening, 
for (  int  y  =  0;  y  <  NS;  y++  ) 

#  pragma  unroll  1 

for (  int  x  =  0;  x  <  NS;  x++  ) 

output [N* (NS*y+x) +t ]  =  ax [x*N+t] *ay [y*N+t]  +  bx [x*N+t ] *by [y*N+t ] ; 


Grid  point  index  "t"  is  the  fast  index  for  coalescing 
(corresponds  to  Kokkos  :  :  LayoutLeft) 
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Naive  Cuda  Kernel  -  one  thread  per  grid  point 

_ global _  void  naive (  double*  _ restrict _  ax,  double*  _ restrict _  bx, 

double*  _ restrict _  ay,  double*  _ restrict _  by,  double*  _ restrict _  output  ) 

{ 

//  Ordinarily  we  might  wrap  this  in  a  grid  stride  loop. ..omitted  to  save  space, 
int  t  =  threadldx.x  +  blockldx . x*blockDim . x; 

#pragma  unroll  1  //  Disallow  compiler  unrolling  so  we  know  what's  happening, 
for (  int  y  =  0;  y  <  NS;  y++  ) 

#  pragma  unroll  1 

for (  int  x  =  0;  x  <  NS;  x++  ) 

output [N* (NS*y+x) +t ]  =  ax [x*N+t] *  ay [y*N+t]  +  bx [x*N+t] Hby [y*N+t] ; 


Grid  point  index  "t"  is  the  fast  index  for  coalescing 
(corresponds  to  Kokkos  :  :  LayoutLeft) 


y-dependent  loads  should  hit  in  cache  (or  be  promoted  to  registers)  during  loop  over  x. 
I  find  that  manually  hoisting  y-loads  to  a  register  does  not  affect  performance. 
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Naive  Cuda  Kernel  -  one  thread  per  grid  point 

_ global _  void  naive (  double*  _ restrict _  ax,  double*  _ restrict _  bx, 

double*  _ restrict _  ay,  double*  _ restrict _  by,  double*  _ restrict _  output  ) 

{ 

//  Ordinarily  we  might  wrap  this  in  a  grid  stride  loop. ..omitted  to  save  space, 
int  t  =  threadldx.x  +  blockldx . x*blockDim . x; 

#pragma  unroll  1  //  Disallow  compiler  unrolling  so  we  know  what's  happening, 
for (  int  y  =  0;  y  <  NS;  y++  ) 

#  pragma  unroll  1 

for (  int  x  =  0;  x  <  NS;  x++  ) 

output  [N*  (NS*y+x) +t  ]  =  ax  [x*N+t]  *  ay  [y*N+t]  +  bx[x*N+t]  ,1|by[y*N+t]  ; 


Grid  point  index  "t"  is  the  fast  index  for  coalescing 
(corresponds  to  Kokkos  :  :  LayoutLeft) 


y-dependent  loads  should  hit  in  cache  (or  be  promoted  to  registers)  during  loop  over  x. 
I  find  that  manually  hoisting  y-loads  to  a  register  does  not  affect  performance. 


Each  x-load  is  used  only  once  per  outer  y-loop  iteration. 
Probably  won't  hit  in  cache  on  the  next  outer  y-loop  iteration. 
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Naive  Cuda  Kernel  -  one  thread  per  grid  point 

_ global _  void  naive (  double*  _ restrict _  ax,  double*  _ restrict _  bx, 

double*  _ restrict _  ay,  double*  _ restrict _  by,  double*  _ restrict _  output  ) 

{ 

//  Ordinarily  we  might  wrap  this  in  a  grid  stride  loop. ..omitted  to  save  space, 
int  t  =  threadldx.x  +  blockldx . x*blockDim . x; 

#pragma  unroll  1  //  Disallow  compiler  unrolling  so  we  know  what's  happening.** 
for (  int  y  =  0;  y  <  NS;  y++  ) 

#  pragma  unroll  1 

for (  int  x  =  0;  x  <  NS;  x++  ) 

output  [N*  (NS*y+x) +t  ]  =  ax  [x*N+t]  *  ay  [y*N+t]  +  bx[x*N+t]  ’lby[y*N+t]  ; 


Grid  point  index  "t"  is  the  fast  index  for  coalescing 
(corresponds  to  Kokkos  :  :  LayoutLeft) 


y-dependent  loads  should  hit  in  cache  (or  be  promoted  to  registers)  during  loop  over  x. 
I  find  that  manually  hoisting  y-loads  to  a  register  does  not  affect  performance. 


Each  x-load  is  used  only  once  per  outer  y-loop  iteration. 
Probably  won't  hit  in  cache  on  the  next  outer  y-loop  iteration. 


Kernel  "naive"  is  strongly  bandwidth-bound,  and 
accesses  are  already  coalesced.  What  should  we  do? 
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Standard  CPU-informed  strategy:  tile  the  loop? 


Recall  why  loop  tiling  helps  on  CPU: 


for (  int  yy  =  0;  yy  <  NS;  yy  +=  TILE_FACTOR  ) 
for (  int  x  =  0;  x  <  NS;  x++  ) 

for (  int  y  =  yy;  y  <  yy  +  TILE_FACTOR;  y++  ) 

output [N* (NS*y+x) +t ]  =  ax [x*N+t] *ay [y*N+t]  +  bx [x*N+t] *by [y*N+t] ; 
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Standard  CPU-informed  strategy:  tile  the  loop? 


Recall  why  loop  tiling  helps  on  CPU: 


for  (  int  yy  =  0 ;  yy  <  NS;  yy  +=  TILE_FACTOR  ) 
for (  int  x  =  0;  x  <  NS;  x++  ) 

for (  int  y  =  yy;  y  <  yy  +  TILE_FACTOR;  y++  ) 

output [N* (NS*y+x) +t ]  =  ax [x*N+t] *ay [y*N+t]  +  bx [x*N+t] *by [y*N+t] ; 


X-dependent  loads  should  hit  in  cache 
for  the  inner  y-loop,  and  be  reused 
TILE  FACTOR  times 
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Standard  CPU-informed  strategy:  tile  the  loop? 


Recall  why  loop  tiling  helps  on  CPU: 


X-dependent  loads  should  hit  in  cache  Each  x-iteration  now  treats  TILE_FACTOR 

for  the  inner  y-loop,  and  be  reused  y-iterations  instead  of  just  one. 

TILE  FACTOR  times 


TILE_FACTOR  y-dependent  loads  should 
hit  in  cache  on  each  iteration  of  the  x-loop 


(in  fact,  for  a  typical  CPU  cache  and  modest  values  of  NS  like  64,  the  entire  working  set  should  easily  fit 
in  cache,  and  it's  not  necessary  to  tile  the  loop  at  all.) 

Pretty  standard  stuff.. .but  do  we  expect  this  to  work  on  a  Kepler  GPU? 
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Loop  tiling  on  GPU 


.global _  void  tiled  (...same  args  as  naive...) 


} 


int  t  =  threadldx.x  +  blockldx . x*blockDim . x; 
for (  int  yy  =  0;  yy  <  NS;  yy  +=  TILE_FACTOR  ) 
for (  int  x  =  0;  x  <  NS;  x++  ) 

for (  int  y  =  yy;  y  <  yy  +  TILE_FACTOR;  y++  ) 
output [N* (NS*y+x) +t ]  =  ax [x*N+t ] *ay [y*N+t ] 

+  bx [x*N+t ] *by [y*N+t ] ; 
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Loop  tiling  on  GPU 


_ global _  void  tiled  (...same  args  as  naive...) 

{ 

int  t  =  threadldx.x  +  blockldx . x*blockDim . x; 
for (  int  yy  =  0;  yy  <  NS;  yy  +=  TILE_FACTOR  ) 
for (  int  x  =  0;  x  <  NS;  x++  ) 

for (  int  y  =  yy;  y  <  yy  +  TILE_FACTOR;  y++  ) 
output [N* (NS*y+x) +t ]  =  ax [x*N+t ] *ay [y*N+t ] 

+  bx [x*N+t ] *by [y*N+t ] ; 


Tiling  is  worse  than  naive.  Cache  per  grid  point 
(thread)  is  just  too  small. 

Read-only  cache  and  LI  cache  are  only  48  KB  each. 

Whichever  compiler  chooses  to  use: 

100%  occupancy  =  2048  threads 

48  KB/2048  threads  =  only  3  doubles'  worth  of  cache  per  thread. 
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Loop  tiling  on  GPU 


.global _  void  tiled  (...same  args  as  naive...) 


} 


int  t  =  threadldx.x  +  blockldx . x*blockDim . x; 
for (  int  yy  =  0;  yy  <  NS;  yy  +=  TILE_FACTOR  ) 
for (  int  x  =  0;  x  <  NS;  x++  ) 

for (  int  y  =  yy;  y  <  yy  +  TILE_FACTOR;  y++  ) 
output [N* (NS*y+x) +t ]  =  ax [x*N+t ] *ay [y*N+t ] 

+  bx [x*N+t ] *by [y*N+t ] ; 


Tiled  Loop  Runtimes 


,234  s 

23U 

S 

,209  s 

,135  s 

Naive  TILE  FACTOR  2  4  8 


Tiling  is  worse  than  naive.  Cache  per  grid  point 
(thread)  is  just  too  small. 

Read-only  cache  and  LI  cache  are  only  48  KB  each. 

Whichever  compiler  chooses  to  use: 

100%  occupancy  =  2048  threads 

48  KB/2048  threads  =  only  3  doubles'  worth  of  cache  per  thread. 

nvprof  confirms  poor  hit  rates  (results  for  TILE_FACTOR  2  shown):** 

nvprof  — kernels  ::tiled:l  -metrics  \ 
nc_cache_global_hit_rate, tex_cache_hit_rate  . /a . out 
.  .  .  Min  Max 

.  .  .  Non-Coherent  Global  Hit  Rate  0.85%  0.85% 

.  .  .  Texture  Cache  Hit  Rate  0.65%  0.65% 


**As  mentioned  previously,  the  compiler  appears  to  use  read-only/texture  cache  for  loads. 

I'm  not  sure  why  there  are  separate  metrics  to  describe  "read-only  cache  accesses"  and  "texture  cache  accesses"  (its  the  same  hardware).  Perhaps  some  Cuda  ninja  can  explain? 
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Tile  with  reduced  occupancy 

100%  occupancy  is  not  a  strict  requirement  for  peak  performance. 

Lower  occupancy  =  more  cache  per  grid  point.** 

Manually  suppress  occupancy  by  giving  each  block  "dummy"  shared  memory. 

For  example:  16  KB  shared  memory  is  available  on  each  SM. 

If  we  assign  each  block  4096  B  smem,  only  4  blocks  can  fit  on  each  SM. 

4*256  =  1024  threads.  1024/2048  =  50%  occupancy. 


_ global _  void  tiled_reduced_occupancy  (...) 

{ 

.extern  _ shared _  int  smem [ ] ; 

int  t  =  threadldx.x  +  blockldx . x*blockDim . x; 
for (  int  yy  =  0;  yy  <  NS;  yy  +=  TILE_FACTOR  ) 
for (  int  x  =  0;  x  <  NS;  x++  ) 

for  (  int  y  =  yy;  y  <  yy  +  TILE_FACTOR;  y++  ) 
output [N* (NS*y+x) +t ]  =  ax [x*N+t] *ay [y*N+t] 

+  bx [x*N+t] *by [y*N+t] ; 


“See  "GPU  Memory  Bootcamp  II:  Beyond  Best  Practices"  from  GTC  2015  (  http://on-demand.gputechconf.com/gtc/2015/presentation/S5376-Tonv-Scudiero.pdf  ) 

for  a  more  detailed  discussion  of  occupancy  vs.  hit  rate. 
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Tile  with  reduced  occupancy 

100%  occupancy  is  not  a  strict  requirement  for  peak  performance. 

Lower  occupancy  =  more  cache  per  grid  point. 

Manually  suppress  occupancy  by  giving  each  block  "dummy"  shared  memory. 

For  example:  16  KB  shared  memory  is  available  on  each  SM. 

If  we  assign  each  block  4096  B  smem,  only  4  blocks  can  fit  on  each  SM. 

4*256  =  1024  threads.  1024/2048  =  50%  occupancy. 


.global _  void  tiled_reduced_occupancy  (...) 

—.extern  _ shared _  int  smem  [  ]  ; 

, — ▼ 

int  t  =  threadldx.x  +  blockldx . x*blockDim . x; 
for  (  int  yy  =  0;  yy  <  NS;  yy  +=  TILE_FACTOR  ) 
for (  int  x  =  0;  x  <  NS;  x++  ) 

for  (  int  y  =  yy;  y  <  yy  +  TILE_FACTOR;  y++  ) 
output [N* (NS*y+x) +t ]  =  ax [x*N+t] *ay [y*N+t] 

+  bx [x*N+t] *by [y*N+t] ; 


Mostly  worse  than  naive. 
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Tile  with  reduced  occupancy 

100%  occupancy  is  not  a  strict  requirement  for  peak  performance. 

Lower  occupancy  =  more  cache  per  grid  point. 

Manually  suppress  occupancy  by  giving  each  block  "dummy"  shared  memory. 

For  example:  16  KB  shared  memory  is  available  on  each  SM. 

If  we  assign  each  block  4096  B  smem,  only  4  blocks  can  fit  on  each  SM. 

4*256  =  1024  threads.  1024/2048  =  50%  occupancy. 


.global _  void  tiled_reduced_occupancy  (...) 

—.extern  _ shared _  int  smem  [  ]  ; 

, — ▼ 

int  t  =  threadldx.x  +  blockldx . x*blockDim . x; 
for  (  int  yy  =  0;  yy  <  NS;  yy  +=  TILE_FACTOR  ) 
for (  int  x  =  0;  x  <  NS;  x++  ) 

for  (  int  y  =  yy;  y  <  yy  +  TILE_FACTOR;  y++  ) 
output [N* (NS*y+x) +t ]  =  ax [x*N+t] *ay [y*N+t] 

+  bx [x*N+t] *by [y*N+t] ; 


Mostly  worse  than  naive.  Sweet  spot  at  TILE_FACTOR  4, 12.5%  occupancy  can  be  explained  by  cache  hits: 

nvprof  — kernels  :  :  tiled_reduced_occupancy :  4  — metrics  achieved_occupancy ,  nc_cache_global_hit_rate,  tex_cache_hit_rate  ./a. out 

.  .  .  Achieved  Occupancy  0.124771  0.124771 

.  .  .  Non-Coherent  Global  Hit  Rate  75.81%  75.81% 
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Tile  using  both  LI  and  read-only  cache 

On  Kepler,  48  KB  read-only  cache  and  64  KB  Ll+shared  cache  are  independent.  Use  both! 


Tile  using  thread-local  arrays  : 

(placed  in  a  local  memory  stack  frame.  Allocated  in  device  global  memory,  but  cached  in  LI)** 


_ global _  void  tiled_local_arrays  (...) 

{ 

double  ay_local [TILE_FACTOR] ;  //  Thread-local  arrays 
double  by_local [TILE_FACTOR] ;  //  (placed  in  local  memory) 
int  t  =  threadldx.x  +  blockldx . x*blockDim . x; 
for (  int  yy  =0;  yy  <  NS;  yy  +=  TILE_FACTOR  ) 

{ 

for  (  int  y  =  yy;  y  <  yy  +  TILE_FACTOR;  y++  ) 

{ 

ay_local [y-yy]  =  ay[y*N+t]; 
by_local [y-yy]  =  by[y*N+t]; 


int  x  =  0;  x  <  NS;  x++  ) 
for (  int  y  =  yy;  y  <  yy  +  TILE_FACTOR;  y++  ) 

output [N* (NS*y+x) +t ]  =  ax [x*N+t] *ay_local [y-yy] 

+  bx [x*N+t ] *by_local [y-yy ] 


**On  Kepler,  local  loads  are  cached  in  LI.  On  Maxwell,  Ll/tex  is  a  single  unified  cache,  and  local  loads  are  cached  in  L2  only.  Therefore,  I  expect  tiling  with 
local  memory  to  be  helpful  on  Kepler  only.  Maxwell  has  separate  hardware  for  shared  memory,  so  you  could  try  using  thread-local  smem  arrays  instead. 

See  for  an  in-depth  discussion  of  where  the  compiler  places  thread-local  arrays. 

See  http://docs.nvidia.eom/cuda/kepler-tuning-guide/#ll-cache  and  http://docs.nvidia.eom/cuda/maxwell-tuning-guide/#ll-cache  for  microarchitecture  details. 
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Tile  using  both  LI  and  read-only  cache 

On  Kepler,  48  KB  read-only  cache  and  64  KB  Ll+shared  cache  are  independent.  Use  both! 


Tile  using  thread-local  arrays  : 

(placed  in  a  local  memory  stack  frame.  Allocated  in  device  global  memory,  but  cached  in  LI) 


_ global _  void  tiled_local_arrays  (...) 

{ 

double  ay_local [TILE_FACTOR] ;  //  Thread-local  arrays 
double  by_local [TILE_FACTOR] ;  //  (placed  in  local  memory) 
int  t  =  threadldx.x  +  blockldx . x*blockDim . x; 
for (  int  yy  =0;  yy  <  NS;  yy  +=  TILE_FACTOR  ) 


for (  int  y  =  yy;  y  <  yy  +  TILE_FACTOR;  y++  ) 


{ 

=  ay  [y*N+t  ]  ;  Thread-local  arrays  for 
=  by  [y*N+t  ]  ;  Y-dependent  loads  (cached  in  LI) 

} 


ay_local [y-yy] 
by_local [y-yy] 


for (  int  x  =  0;  x  <  NS;  x++  ) 


int  y  =  yy;  y  <  yy  +  TILE_FACTOR;  y++  ) 
output [N* (NS*y+x) +t ]  =  ax[x*N+t] 

+  bx [x*N+t ] 


*ay_local [y-yy] 
*by_local [y-yy] 
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Tile  using  both  LI  and  read-only  cache 

On  Kepler,  48  KB  read-only  cache  and  64  KB  Ll+shared  cache  are  independent.  Use  both! 
Tile  using  thread-local  arrays  : 

(placed  in  a  local  memory  stack  frame.  Allocated  in  device  global  memory,  but  cached  in  LI) 

_ global _  void  tiled_local_arrays  (...) 

{ 

double  ay_local [TILE_FACTOR] ;  //  Thread-local  arrays 
double  by_local [TILE_FACTOR] ;  //  (placed  in  local  memory) 
int  t  =  threadldx.x  +  blockldx . x*blockDim . x; 
for (  int  yy  =0;  yy  <  NS;  yy  +=  TILE_FACTOR  ) 

{ 

for  (  int  y  =  yy;  y  <  yy  +  TILE_FACTOR;  y++  ) 

{ 

ay_local [y-yy] 
by_local [y-yy] 

} 

for (  int  x  =  0;  x  <  NS;  x++  ) 

for (  int  y  =  yy;  y  <  yy  +  TILE_FACTOR;  y++  ) 
output [N* (NS*y+x) +t ]  = 

+ 

'  X-dependent  loads  cached  in  read-only 


=  ay  [y*N+t  ]  ;  Thread-local  arrays  for 
=  by  [y*N+t  ]  ;  Y-dependent  loads  (cached  in  LI) 
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Tile  using  both  LI  and  read-only  cache 

On  Kepler,  48  KB  read-only  cache  and  64  KB  Ll+shared  cache  are  independent.  Use  both! 
Tile  using  thread-local  arrays  : 

(placed  in  a  local  memory  stack  frame.  Allocated  in  device  global  memory,  but  cached  in  LI) 

_ global _  void  tiled_local_arrays  (...) 

{ 

double  ay_local [TILE_FACTOR] ;  //  Thread-local  arrays 
double  by_local [TILE_FACTOR] ;  //  (placed  in  local  memory) 
int  t  =  threadldx.x  +  blockldx . x*blockDim . x; 
for (  int  yy  =0;  yy  <  NS;  yy  +=  TILE_FACTOR  ) 

{ 

for  (  int  y  =  yy;  y  <  yy  +  TILE_FACTOR;  y++  ) 

{ 

ay_local [y-yy] 
by_local [y-yy] 

} 

for (  int  x  =  0;  x  <  NS;  x++  ) 

for (  int  y  =  yy;  y  <  yy  +  TILE_FACTOR;  y++  ) 
output [N* (NS*y+x) +t ]  = 

+ 

'  X-dependent  loads  cached  in  read-only 


Fast  forTILE_FACTOR  =  2!  LI  cache  fields  all  y-dependent  loads  (100%  hit  rate) 
Slower  for  TILE  FACTOR  =  4  and  8.  Hit  rate  decreases. 


=  ay  [y*N+t  ]  ;  Thread-local  arrays  for 
=  by  [y*N+t  ]  ;  Y-dependent  loads  (cached  in  LI) 
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Tile  with  explicit  register  use  ("unroll-and-jam") 

Kepler  SM  has  65,536  4B  registers  =  262  KB  of  near-core  memory  available  as  registers. 
>2.5X  more  than  read-only  and  LI  caches  combined. 

_ global _  void  unroll_and_jam_by2_registers  (...) 

{ 

//  Encourage  these  to  be  placed  in  registers 
double  ay_localO,  by_localO,  ay_locall,  by_locall; 
int  t  =  threadldx.x  +  blockldx . x*blockDim. x; 

#pragma  unroll  1 

for (  int  yy  =  0;  yy  <  NS;  yy  +=  2  ) 

{  ay_local0  =  ay [ (yy+0 ) *N+t ] ; 
by_local0  =  by [ (yy+0 ) *N+t ] ; 
ay_locall  =  ay [ (yy+1 ) *N+t ] ; 
by_locall  =  by [ (yy+1 ) *N+t ] ; 

#  pragma  unroll  1 

for (  int  x  =  0;  x  <  NS;  x++  ) 

{ 

output [N* (NS* (yy+0) +x) +t]  =  ax [x*N+t ] *ay_local0 

+  bx [x*N+t] *by_local0; 

output [N* (NS* (yy+1 ) +x) +t ]  =  ax [x*N+t ] *ay_locall 

+  bx [x*N+t] *by_locall; 

} 

} 
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Tile  with  explicit  register  use  ("unroll-and-jam") 

Kepler  SM  has  65,536  4B  registers  =  262  KB  of  near-core  memory  available  as  registers. 
>2.5X  more  than  read-only  and  LI  caches  combined. 

_ global _  void  unroll_and_jam_by2_registers  (...) 

{ 

//  Encourage  these  to  be  placed  in  registers 
double  ay_localO,  by_localO,  ay_locall,  by_locall; 
int  t  =  threadldx.x  +  blockldx . x*blockDim. x; 

#pragma  unroll  1 

for (  int  yy  =  0;  yy  <  NS;  yy  +=  2  ) 

{  ay_local0  =  ay [ (yy+0 ) *N+t ] ; 

by_local0  =  by [ (yy+0 ) *N+t ] ;  Y-dependent  loads  reused 
ay_locall  =  ay [ (yy+1 ) *N+t ] ;  x  times  in  x-loop 
by_locall  =  by [ (yy+1 ) *N+t ] ; 

#  pragma  unroll  1 
for (  int  x  =  0;  x  <  NS;  x++  ) 

{ 

output [N* (NS* (yy+0) +x) +t]  =  ax[x*N+t] 

+  bx[x*N+t] 

output [N* (NS* (yy+1 ) +x) +t ]  =  ax[x*N+t] 

+  bx[x*N+t] 

} 

} 
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Tile  with  explicit  register  use  ("unroll-and-jam") 

Kepler  SM  has  65,536  4B  registers  =  262  KB  of  near-core  memory  available  as  registers. 
>2.5X  more  than  read-only  and  LI  caches  combined. 

_ global _  void  unroll_and_jam_by2_registers  (...) 

{ 

//  Encourage  these  to  be  placed  in  registers 
double  ay_localO,  by_localO,  ay_locall,  by_locall; 
int  t  =  threadldx.x  +  blockldx . x*blockDim. x; 

#pragma  unroll  1 

for (  int  yy  =  0;  yy  <  NS;  yy  +=  2  ) 

{  ay_local0  =  ay [ (yy+0 ) *N+t ] ; 

by_local0  =  by [ (yy+0 ) *N+t ] ;  Y-dependent  loads  reused 
ay_locall  =  ay [ (yy+1 ) *N+t ] ;  x  times  in  x-loop 
by_locall  =  by [ (yy+1 ) *N+t ] ; 

#  pragma  unroll  1 

for (  int  x  =  0;  x  <  NS;  x++  ) 

{  _  _ 

output [N* (NS* (yy+0) +x) +t]  =  ax [x*N+t] *ay_local0 

+  bx [x*N+t] *by_local0 : 

output [N* (NS* (yy+1 ) +x) +t ]  =  ax [x*N+t ] * ay_locall 

+  bx [x*N+t] *by_locall : 

^  X-dependent  loads  used  twice 


DISTRIBUTION  A:  Approved  for  public  release;  distribution  unlimited.  PA  Clearance  Number  17207 


47 


Tile  with  explicit  register  use  ("unroll-and-jam") 

Kepler  SM  has  65,536  4B  registers  =  262  KB  of  near-core  memory  available  as  registers. 
>2.5X  more  than  read-only  and  LI  caches  combined. 

_ global _  void  unroll_and_jam_by2_registers  (...) 

{ 

//  Encourage  these  to  be  placed  in  registers 
double  ay_localO,  by_localO,  ay_locall,  by_locall; 
int  t  =  threadldx.x  +  blockldx . x*blockDim. x; 

#pragma  unroll  1 

for (  int  yy  =  0;  yy  <  NS;  yy  +=  2  ) 

{  ay_local0  =  ay [ (yy+0 ) *N+t ] ; 

by_local0  =  by [ (yy+0 ) *N+t ] ;  Y-dependent  loads  reused 
ay_locall  =  ay [ (yy+1 ) *N+t ] ;  x  times  in  x-loop 
by_locall  =  by [ (yy+1 ) *N+t ] ; 

#  pragma  unroll  1 

for (  int  x  =  0;  x  <  NS;  x++  ) 

{  _  _ 

output [N* (NS* (yy+0) +x) +t]  =  ax [x*N+t] *ay_local0 

+  bx [x*N+t] *by_local0 : 

output [N* (NS* (yy+1 ) +x) +t ]  =  ax [x*N+t ] * ay_locall 

+  bx [x*N+t] *by_locall : 

^  X-dependent  loads  used  twice 


In  practice  I  like  this  approach. 


At  50%  occupancy  you  can  use  up  to  64  registers  (32  DP  values)  for 
tiling.  Unrolling  by  2  or  4  is  not  too  annoying  for  a  few 
performance-limiting  kernels. 


...but  don't  do  it  for  all  your  kernels. 
"Premature  optimization  is  the  root  of  all  evil" 
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Cooperative  pattern 


Each  grid  point  handled  by  a  ■single  thread  a  warp. 

_ global _  void  warp_team  (...) 

{ 

int  warpid  =  (  threadldx.x  +  blockldx . x*blockDim. x  )/32; 
int  laneid  =  threadldx . x%32 ; 
int  t  =  warpid; 

#pragma  unroll  1 

for (  int  y  =  laneid;  y  <  NS;  y  +=  32  ) 

{ 

double  ayy  =  ay[NS*t+y]; 
double  byy  =  by[NS*t+y]; 

#  pragma  unroll  1 

for (  int  x  =  0;  x  <  NS;  x++  ) 

output [NS*NS*t+NS*x+y]  =  ax [NS*t+x] *ayy  +  bx [NS*t+x] *byy ; 

} 

} 
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Cooperative  pattern 


Each  grid  point  handled  by  a  ■single  thread  a  warp. 

_ global _  void  warp_team  (...) 

{ 

int  warpid  =  (  threadldx.x  +  blockldx . x*blockDim. x  )/32; 
int  laneid  =  threadldx . x%32 ; 
int  t  =  warpid; 

#pragma  unroll  1 

for (  int  y  =  laneid;  y  <  NS;  y  +=  32  ) 

{ 

double 
double 
#  pragma  unroll  1 
for (  int  x  =  0;  x  <  NS;  x++  ) 

output [NS*NS*t+NS*x+y]  =  ax [NS*t+x] *ayy  +  bx [NS*t+x] *byy  ; 

} 

} 


ayy  =  ay[NS*t+y] 
Ibyy  =  by[NS*t+y] 


Y-dependent  loads  are  coalesced. 
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Cooperative  pattern 


Each  grid  point  handled 


i  a  warp. 


_ global _  void  warp_team  (...) 

{ 

int  warpid  =  (  threadldx.x  +  blockldx . x*blockDim. x  )/32; 
int  laneid  =  threadldx . x%32 ; 
int  t  =  warpid; 

#pragma  unroll  1 

for (  int  y  =  laneid;  y  <  NS;  y  +=  32  ) 

{ 

double  ayy  ay  [NS  t+y]  ,  y-dependent  loads  are  coalesced. 

double  byy  =  by[NS*t+y]; 

#  pragma  unroll  1 

for (  int  x  =  0;  x  <  NS;  x++  ) 

output [NS*NS*t+NS*x+y]  =  ax [NS*t+x] *ayy  +  bx [NS*t+x] *byy ; 


X-dependent  loads  broadcast  a  value  across  the  warp.  Uncoalesced. 


X-loads  are  uncoalesced... BUT  next  x-iteration  accesses  next  contiguous  location  in  memory... 

AND  effective  cache  per  grid  point  is  now  32X  higher. ..perhaps  the  next  x-load  will  hit? 
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Cooperative  pattern  is  fastest! 


Each  grid  point  handled  by  a  ■single  thread  a  warp. 


_ global _  void  warp_team  (...) 

{ 

int  warpid  =  (  threadldx.x  +  blockldx . x*blockDim. x  )/32; 
int  laneid  =  threadldx . x%32 ; 
int  t  =  warpid; 

#pragma  unroll  1 

for (  int  y  =  laneid;  y  <  NS;  y  +=  32  ) 

{ 

double 


ayy  =  ay[NS*t+y] 
Ibyy  =  by[NS*t+y] 


double 
#  pragma  unroll  1 
for (  int  x  =  0;  x  <  NS;  x++  ) 
output [NS*NS*t+NS*x+y]  = 


Y-dependent  loads  are  coalesced. 


ax [NS*t+x] 


ayy  + 


bx [NS*t+x] 


cbyy  ; 


X-dependent  loads  broadcast  a  value  across  the  warp.  Uncoalesced. 


X-loads  are  uncoalesced. ..BUT  next  x-iteration  accesses  next  contiguous  location  in  memory... 

AND  effective  cache  per  grid  point  is  now  32X  higher. ..perhaps  the  next  x-load  will  hit? 

Nvprof  confirms:  high  hit  rates  =>  fast  kernel!! 

nc_cache_global_hit_rate  =  95.39%,  tex_cache_hit_rate  =  95.39% 


Warp  Team  Runtime 


.135  s 


.052  s 


Naive  Warp  team 
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Downside  to  cooperative:  need  different  memory  layout 


Kernel  with  each  thread  handling  a  grid  point 


_ global _  void  naive  (...) 

{ 


} 


int  t  =  threadldx.x  +  blockldx . x*blockDim. x; 

#pragma  unroll  1 

for (  int  y  =  0;  y  <  NS;  y++  ) 

#  pragma  unroll  1 

for (  int  x  =  0;  x  <  NS;  x++  ) 


output 

ax 

bx 


(NS*y+x)  +t|]  = 


x*N+t 

x*N+t 


|]  *ay 
*by 


|y*N+t| 

y*N+t[ 


+ 


Grid  point  index  t  is  fast  index  for  output ,  ax,  ay,  bx,  by. 

Consecutive  threads  handle  consecutive  grid  points  => 
coalesced  access. 

Corresponds  to  Kokkos:  :LayoutLeft 


Cooperative  kernel 


_ global _  void  warp_team  (...) 

{ 


int  warpid  =  (threadldx.x  +  blockldx . x*blockDim. x) /32 ; 
int  laneid  =  threadldx . x%32 ; 
int  t  =  warpid; 

#pragma  unroll  1 

for (  int  y  =  laneid;  y  <  Ns;  y  +=  32  ) 

{ 


} 


double  ayy  =  ay 
double  byy  =  by 
#  pragma  unroll 
for (  int  x  =  0;  x  <  NS;  x++  ) 


NS*t+y 

NS*t+y 


1 


output  [|NS*NS*t+NS*x+y[]  = 
ax  | 
bx  [ 


[NS 


[NS 


*t+x  i 
*t+x  | 


*ayy  + 
*byy; 


} 


Each  warp  handles  one  grid  point. 

Fast  index  must  be  species  index  x  or  y  (they  are  symmetric) 
for  spatially  local  accesses  by  warps. 

Corresponds  to  Kokkos  :  :  LayoutRight 
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